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Abstract — We consider the classical multi-armed bandit 
problem with Markovian rewards. When played an arm 
changes its state in a Markovian fashion while it remains frozen 
when not played. The player receives a state-dependent reward 
each time it plays an arm. The number of states and the state 
transition probabilities of an arm are unknown to the player. 
The player's objective is to maximize its long-term total reward 
by learning the best arm over time. We show that under certain 
conditions on the state transition probabilities of the arms, a 
sample mean based index policy achieves logarithmic regret 
uniformly over the total number of trials. The result shows that 
sample mean based index policies can be applied to learning 
problems under the rested Markovian bandit model without 
loss of optimality in the order. Moreover, comparision between 
Anantharam's index policy and UCB shows that by choosing 
a small exploration parameter UCB can have a smaller regret 
than Anantharam's index policy. 

I. Introduction 

In this paper we study the single player multi-armed bandit 
problem where the reward of each arm is generated by a 
Markov chain with unknown statistics, and the states of the 
Markov chain evolves only when the arm is played. We will 
investigate the performance of an index policy that depends 
only on the sample mean reward of an arm. 

In the classical multi-armed bandit problem, originally 
proposed by Robbins [1], a gambler (or player) must decide 
which one of the K machines (or arms) to activate (or play) 
at each discrete step in a sequence of trials so as to maximize 
his long term reward. Every time he plays an arm, he receives 
a reward (or payoff). The structure of the reward for each arm 
is unknown to the player a priori, but in most prior work the 
reward has been assumed to be independently drawn from a 
fixed (but unknown) distribution. The reward distribution in 
general differs from one arm to another, therefore the player 
must use all his past actions and observations to essentially 
"learn" the quality of these arms (in terms of their expected 
reward) so he can keep playing the best arm. 

This problem is a typical example of the trade-off between 
exploration and exploitation. On the one hand, the player 
needs to sufficiently explore or sample all arms so as to 
discover with accuracy the best arm and avoid getting stuck 
playing an inferior one erroneously believed to be the best. 
On the other hand, the player needs to avoid spending too 
much time sampling the arms and collecting statistics and 
not playing the best arm often enough to get a high return. 

Within this context, the player's performance is typically 
measured by the notion of regret. It is defined as the 
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difference between the expected reward that can be gained 
by an "infeasible" or ideal policy, i.e., a policy that requires 
either a priori knowledge of some or all statistics of the 
arms or hindsight information, and the expected reward of 
the player's policy. The most commonly used infeasible 
policy is the best single action policy, that is optimal among 
all policies that continue to play the same arm. An ideal 
policy could play for instance the arm that has the highest 
expected reward (which requires statistical information but 
not hindsight). This type of regret is sometimes also referred 
to as the weak regret, see e.g., work by Auer et al. [2]. In 
this study we will only focus on this definition of regret. 

Most studies in this area assume iid rewards, with the 
notable exception of [5]. In [3] rewards are modeled as 
single-parameter univariate densities. Under some conditions 
such as the denseness of the parameter space and continuity 
of the Kullback-Leibler number between two densities, Lai 
and Robbins [3] give a lower bound on the regret and 
construct policies that achieve this lower bound which are 
called asymptotically efficient policies. This result is ex- 
tended by Anantharam et al. in [4] to the case where playing 
more than one arm at a time is allowed. Using a similar 
approach Anantharam et al. in [5] develops index policies 
that are asymptotically efficient for arms with rewards driven 
by finite, irreducible, aperiodic Markov chains with identi- 
cal state spaces and single-parameter families of stochastic 
transition matrices. Agrawal in [6] considers sample mean 
based index policies for the iid model that achieve O(logn) 
regret, where n is the total number of plays, with a constant 
that depends on the Kullback-Leibler number. He imposes 
conditions on the index functions for which they can be 
treated as upper confidence bounds and generates these index 
functions for specific one-parameter family of distributions. 
Auer et al. in [7] also proposes sample mean based index 
policies for iid rewards with bounded support; these are 
derived from [6], but are simpler than the those in [6] 
and are not restricted to a specific family of distributions. 
These policies achieve logarithmic regret uniformly over time 
rather than asymptotically in time, but have bigger constant 
than that in [3]. [7] also proposes randomized policies that 
achieve logarithmic regret uniformly over time by using an 
exploration factor that is inversely proportional to time. 

Other works such as [8], [9] consider the iid multiarmed 
bandit problem in the multiuser setting. Players selecting 
the same arms experience collision according to a certain 
collision model. In [9] when a collision occurs on an arm, 
none of the players selecting that arm receive a reward. In 
[8] an additional collision model is considered where one of 



the colliding players gets the reward. Assume that there are 
M players and M is less than the number of arms K. The 
main idea underlying the policies used in such multi-user 
setting is to encourage the players to play the best M arms, 
while playing the other arms only logarithmically. 

Its worth noting that when the reward process is iid, the 
question of "what happens to the arms that are not played" 
does not arise. This is because whether the unselected arms 
remain still (frozen in their current states) or transition to 
another state with a different reward is inconsequential; in 
either case the player does not obtain the reward from arms 
he does not play. Since the rewards are independently drawn 
each time, remaining still or not does not affect the reward 
the arm produces the next time it is played. This simplifies 
the problem significantly if the physical system represented 
by such a multiarmed bandit model is such that the arms 
cannot stay still (or are restless). This unfortunately is not 
the case with Markovian rewards. There is a clear difference 
between whether the arms are rested or restless. In the rested 
case, since the state is frozen when an arm is not played, the 
state we next observe the arm to be in is independent of 
how much time elapses before we play the arm again. In 
the restless case, the state of an arm continues to evolve 
accordingly to the underlying Markov law regardless of the 
player's decision, but the actual state is not observed nor 
the reward obtained unless the arm is chosen by the player. 
Clearly in this case the state we next observe it to be in is now 
dependent on the amount of time that elapses between two 
plays of the same arm. This makes the problem significantly 
more difficult. To the best of our knowledge, there has been 
no study of the restless bandits in this learning context. In 
this paper we will only focus on the rested case. 

As [5] is the most closely related to the present paper, 
we now elaborate on the differences between the two. In [5] 
the rewards are assumed to be generated by rested Markov 
chains with transition probability matrices parametrized by 
a single index 9. This implies that for states x, y in the state 
space of arm i the transition probability from state x to y 
is given by p xy (9), where the player knows the function 
Pxy{9) but not the value of 9. Because of this assumption 
the problem more or less reduces to a single-parameter 
estimation problem. Indices are formed in [5] using natural 
statistics to test the hypothesis that the rewards from an 
arm are generated by a parameter value less than 9 or by 
9. The method also requires log-concavity of p in 9 as an 
assumption in order to have a test statistic increasing in 9. By 
contrast, in the present paper we do not assume the existence 
of such a single-parameter function p, or if it does exist it 
is unknown to the player. We however do require that the 
Markovian reward process is reversible. Secondly, while [5] 
assumes that arms have identical state spaces, our setting 
allows arms to have different state spaces. Thirdly, there is no 
known recursive methods to compute the indices used in [5] 
which makes the calculation hard, while our indices depend 
on the sample mean of the arms which can be computed 
recursively and efficiently. Finally, the bound produced in [5] 
holds asymptotically in time, while ours (also logarithmic) 



holds uniformly over time. We do, however, use very useful 
results from [5] in our analysis as discussed in more detail in 
subsequent sections. It should be noted that our results is not 
a generalization of that in [5] since the two regret bounds, 
while of the same order, have different constants. 
Our main results are summarized as follows. 

1) We show that when each arm is given by a finite state 
irreducible, aperiodic, and reversiblqj Markov chain 
with positive rewards, and under mild assumptions on 
the state transition probabilities of the arms, there exist 
simple sample mean based index policies that achieve 
logarithmic regret uniformly over time. 

2) We interpret the conditions on the state transition prob- 
abilities in a simple model where arms are modeled as 
two-state Markov chains with identical rewards. 

3) We compare numerically the regret of our sample 
mean based index policy under different values of the 
exploration parameter. We also compare our policy 
with the index policy given in [5]. 

We end this introduction by pointing out another important 
class of multi-armed bandit problems solved by Gittins [10]. 
The problem there is very different from the one considered 
in this study (it was referred to as the deterministic bandit 
problem by [3]), in that the setting of [10] is such that 
the rewards are given by Markov chains whose statistics 
are perfectly known a priori. Therefore the problem is one 
of optimization rather than exploration and exploitation: the 
goal is to determine offline an optimal policy of playing the 
arms so as to maximize the total expected reward over a 
finite or infinite horizon. 

The remainder of this paper is organized as follows. In 
Section HI] we formulate the single player rested Markovian 
bandit problem and relate the expected regret with expected 
number of plays from arms. In Section [Til] we propose an 
index policy based on [7] and analyze the regret of that policy 
in the rested Markovian model. Discussion on this result is 
given in Section [IV] In Section [V] we give an application 
that can be modeled as a rested bandit problem and evaluate 
the performance of our policy for this application. Finally, 
Section |VT] concludes the paper. 

II. Problem Formulation and Preliminaries 

We assume that there are K arms indexed by i = 
1,2, ••• ,K. The ith arm is modeled as an irreducible 
Markov chain with finite state space S l . Rewards drawn 
from a state of an arm is stationary and positive. Let rj. 
denote the reward obtained from state x of arm i. Let P % = 
{p l xy ,x,y eS'} denote the transition probability matrix of 
arm i. We assume the arms (i.e., the Markov chains) are 
mutually independent. The mean reward from arm i, denoted 
by /i\ is the expected reward of arm i under its stationary 
distribution n l = (ir x ,X G S l ). Then, 
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'Note that reversibility is actually not necessary for our main result, i.e., 
the logarithmic regret bound, to hold, provided that we use a large deviation 
bound from [12] rather than from [1 1] as we have done in the present paper. 



For convenience, we will use * in the superscript to de- 
note the arm with the highest mean. For instance, fi* — 
maxi<i</f /i\ For a policy a we define its regret R a (n) as 
the difference between the expected total reward that can be 
obtained by playing the arm with the highest mean and the 
expected total reward obtained from using policy up to time 
n. Let a(t) be the arm selected by the policy a at t, and 
x a u) the state of arm a(t) at time t. Then we have 



R a (n) = ny* - E a 



E'- 



Mt) 



(2) 



The objective of the study is to examine how the regret 
R a {n) behaves as a function of n for a given policy a, 
through appropriate bounding. 

Note that playing the arm with the highest mean is the 
optimal policy among all single-action policies. It is, how- 
ever, not in general the optimal policy among all stationary 
and nonstationary policies if all statistics are known a priori. 
The optimal policy in this case (over an infinite horizon) is 
the Gittins index policy, first given by Gittins in his seminal 
paper [10]. In the special case where rewards of each arm is 
iid, the optimal policy over all stationary and nonstationary 
policies is indeed the best single-action policy. In this study 
we will restrict our performance comparison to the best 
single-action policy. 

To proceed, below we introduce a number of preliminaries 
that will be used in later analysis. The key to bounding 
i? 7r (n) is to bound the expected number of plays of any 
suboptimal arm. Let T a - l (t) be the total number of times 
arm i is selected by policy a up to time t. We first 
need to relate regret R a (n) with E a [T a,t (n)]. We use the 
following lemma, which is Lemma 2.1 from [5], The proof 
is reproduced here for completeness. 

Lemma 1: [Lemma 2.1 from [5]] Let Y be an irreducible 
aperiodic Markov chain with a state space S, transition 
probability matrix P, an initial distribution that is non-zero in 
all states, and a stationary distribution {tt x },\/x £ S, Let F t 
be the er-field generated by random variables Xi,X2, ..., X t 
where X t corresponds to the state of the chain at time t. 
Let G be a tr-field independent of F = Vt>iFu the smallest 
tr-field containing F\, F%, .... Let r be a stopping time with 
respect to the increasing family of tr-fields {GV F t ,t > 1}. 
Define N(x,t) such that 

T 

N{x,T)=Y^I{Xt = x). 

Then Vr such that E[t] < oo, we have 

\E[N(x,t)]-tt x E[t]\<Cp, (3) 

where Cp is a constant that depends on P. 
Proof: Define {F t ,t > 1} stopping times 

T fe = w£{t>T k -i\X t = X 1 },k = l,2,... 
r = 1. 

Because of irreducibility, r k < oo. Define B k as 
the fcth block. For a sample path w it is the sequence 



(x Tk - 1 (w),x Tkl ( w - )+1 , ■ ■ ■ ,x rfc („,)_!). Then, 

F Th =a(B 1 ,...,B k ). 

Let S* = U t >i5* be the Borel cr-field of the discrete 
topology. For x,y £ S,y = (yi, y 2 , ..., y t ) € S*, let Z(y) 
be the length of y. Then by the regenerative cycle theorem 
sequence {B^} is i.i.d and 

EN(x,Bi) =ir x El{B 1 ) 

Let T = inf {t > r\X t = X x }. Then T = t k where k is 
a stopping time of F Tk and {t k _i < r} £ F Tk1 . By Wald's 
lemma, 

T-l K 

E J2 J (Xt =x) = Ej2 N(x, B k ) = ■k x EI{B 1 )Ek. 



l=i 



fc=i 



E(T -1) = E^2 K B k) = EI(Bi)Ek. 
fc=i 

Again by irreducibility since the mean time to return to 
any state starting from X T is finite E(T — r) < Cp. Then 
for any x £ S, 

N(x, T)-(T-t)< N(x, t) < N{x, T) 
tt x E(T -l)-C P < EN{x, t) < tt x E(T - 1) + 1 
tt x Et -C P < EN(x, t) < tt x E(t) + C P 
\EN(x,t)~tt x Et\ <C P . 



Corollary 1: For 7r min = min xeS 7Tx, Cp < l/7r min . 
Proof: From Lemma [T] we have t k _i < r < t k 
a.s. since if for some sample path w with nonzero measure 
t k -i(w) > t(w), then T(w) — t k -i(w) contradicts with 
r K > T K -i. Since t k — r K _i is the time for return to Xi, 
by the irreducibility of the chain E(t k — r rt _i) = I/ttxx < 

1/TTmin- ■ 

We are now ready to establish a relationship between the 
regret R a (n) and E a [T a '\n)]. 

Lemma 2: If the reward of each arm is given by a Markov 
chain satisfying the properties of Lemma Q] then under any 
policy a for which the expected time between two successive 
samples from an arm is finite, we have 



K 



R a (n) <J2(p* - S)E a [T^(n)} +G 



4=1 



S.P,r 



(4) 



where Cs,p, r is a constant that depends on all the state 
spaces 5\ transition probability matrices P\ and the set of 
rewards r\ i = 1, • ■ ■ , K. 

Proof: Let C = Vj^tiF 3 , Since arms are independent 
G" is independent of F l where F z follows the definition 
in Lemma Q] applied to the ith arm. Note that T a,l (n) 
is a stopping time with respect to {G J V F,^, n > l}. Let 
X l (l), ...,X l (T a,l (n)) denote the successive states observed 
from arm i up to n. Thus, X l (t) is the t th observation from 



arm i. Then the total reward obtained under policy a up to 
time n is given by: 

n K T a ' i {n) 

E'ffi-E E E'-i^o-j-v). 

By definition, the regret is R a (n) = n/J,* — 



E° 






. Therefore 



iT(n) - n^' 



-J2SE a [T a \n)}\ 



(5) 
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X T Q ' l (n) 

E E E^^'OO-y) 

1 = 1 J = l yGS> 



A" 



- E E r >l E ° t TQ »] 

i=l ij£S' 



A 

± EE 



«=i j/es* 



E Q 



]T r*7(X«(i)=y) 



f ; 7rj£ Q [^'(n)] | 



A' 



= E E r l \ Ea \ N (v> T "'»)] - K E " [ r °'>)] I 

A 
^ E E r t C ^ = C S,P,r , (6) 

where the first inequality follows from the triangle inequality 
and the fact that random variables corresponding to the states 
of the Markov chains are independent of the policy a, the 
second equality follows since T a ' l (n) is a stopping time with 
respect to {G l V JF£, n > l}, and the last inequality follows 
form Lemma Q] ■ 

Intuitively Lemma [2] is quite easy to understand. It states 
that the regret of any policy (its performance difference from 
always playing the best arm) is bounded by the sum over the 
expected differences from playing each non-optimal arms, 
subject to a constant. 

III. An Index Policy and Its Regret Analysis 

We consider the following sample-mean based index 
policy proposed by [7], referred to as the UCB (upper 
confidence bound) policy. 

Denote by r l (k) the sample reward from arm i when it is 
played for the fcth time, and by T l (n) the total number of 
times arm i has been played up to time n. Then the sample 
mean reward from playing arm i is given by f z (T l (n)) = 



r'(l)+r'(2) + ...+r'(T'(n)) 



arm, denoted by g n Ti ,\ for arm i, and plays at each time 
the arm with the highest index. 

The index is updated as follows. Initially each arm is 
played exactly once. For each arm played, its sample mean 
is updated; this corresponds to the first term of the index. If 
an arm is not played, then the uncertainty about the mean of 



the arm is updated; this corresponds to the second term of 
the index. This algorithm is illustrated below. 



UCB (Upper Confidence Bound) 




Initialization: n 


= 1 




for (n < K) 






play arm n; 


n = n + 1. 




while (n > K) 






f l (T l (n)) = 


r ! (l)+r l (2) + ...+r*(T' 
T*(rO 


(«)). 


9n,T l (n) = T 


CT^ + Vt^' 


Vt. 


Play the arm 


with the highest index. 


n = n + 1. 







In the index policy of [7] the constant L is set to 2. Below 
we will show that the regret of this policy grows at most 
logarithmically in n. To do this we will bound the expected 
number of plays of any non-optimal arm (with mean less 
than the mean of the best arm). We will use the following 
lemma by Gillman [11], which bounds the probability of a 
large deviation from the stationary distribution. 

Lemma 3: [Theorem 2.1 from [11]] Consider a finite- 
state, irreducible, aperiodic and reversible Markov chain 
with state space S, matrix of transition probabilities P, and 
an initial distribution q. Let A^q = (-,i£S) . Let 

e = 1 — A2, where A2 is the second largest eigenvalue of 
the matrix P. e will be referred to as the eigenvalue gap. Let 
Ac S, Let iji(n) be the number of times that states in the 
set A are visited up to time n. Then for any 7 > 0, we have 



P(t A (n) ~ U7TA > 7) < (1 + ^Aqe-^/ 20 ", (7) 



where 



TA = E 



xeA 
Proof: See Theorem 2.1 of [11]. ■ 

We now state the main theorem of this section. The proof 
follows similar methods as used in [7]. 

Theorem 1: Assume all arms are modeled as finite 
state, irreducible, aperiodic, and reversible Markov chains, 
and assume all rewards are positive. Let 7r m i n = 
min xeS!! i< 4 <A-<, r max = max 1 < i < Jfia . eg< r*, r min = 
^ n l<i<K,xi£S i r x> S'max = maxi<i<x \S l \, e max = 
maxi<i<A- e\ e m i n = mini<i<A" e\ where e l is the eigen- 
value gap of the ith arm. Then using a constant L > 



The policy defines an index for each 905 



max max 



/e m i n , the regret of the UCB policy can be 



bounded above by 



In ? 



R(n) < 4L V 



E (m*-m i )c* 



(8) 



where 



expectation on both sides gives: 



C 
D' 



1 + [D l + D*)(3, 
\S*\ L . e max% /I 



E[T\n)] < 



10 5*r mir 



4Lln? 



Proof: Throughout the proof all quantities pertain to 
the UCB policy, which will be denoted by a and suppressed 
from the superscript whenever there is no ambiguity. Let 
r t (T l (n)) denote the sample mean of the reward collected 
from arm i over the first n plays. Let c ts = y/Llnt/s, and 
let I be any positive integer. Then, 



oo t— 1 J— 1 

+ EE E p(f*( S )<v*~c t , s ) 



* i (m*-^) 2 

oo t-1 t-1 



EE E pPm >»'+<*,.*)' 



Consider an initial distribution q* for the zth arm. We have: 



T\n) = l+ J2 H<*V)=i) 

n 

^ l + E I(<x(t)=i,T i (t-l)>l) 

t=K+l 
n 

< l + E l(lHt,l),T\t-l)>l) 

t=K+l 
n 

< i+ J2 nc(t,i)) 

t=K+l 

oo t-1 t-1 



AT qi = 



7T?, 



< 



E 



2 yes* 



< 



t=l s=l s i= i 



where 7*(t, is the event that 

r*(T*(t - 1)) + c^LT-ft-i) < ^(T* (t - 1)) + c t _i, T * (t _i )l 

C 4 (i, is the event that 



min {r*(s) + c t -i .) < max (r l (s 4 ) + c t _i .J. 

0<s<i l< Si <t 



We now show that r*(s) + c tjS < r ! (si) + Ct )Sj implies 
that at least one of the following holds. 



f*(s) < (J,* - Ct,s 

f\ Si ) > n l + c tiSi 
fi* < /j + 2c 4> , 



(10) 

(11) 
(12) 



This is because if none of the above holds, then we must 
have 

f*(s) + c M > n* > // + 2ct, si > r l (s t ) + Ct, Si , 

which contradicts f*(s) + Ct, s < r l (si) + Ct ]Si . 

If we choose Si > 4Zlnn/(// — ^ l ) 2 , then 2c t}S i < A** — 
//, which means (TT2l is false, and therefore at least one 
of ( [Tol l and (fTTT i has to be true with this choice of Sj. We 
next take I — /^"'^ , and proceed from (0. Taking 



where the first inequality follows from Minkowski inequality. 
Let nl. (t) denote the number of times state y of arm i is 
observed up to time t. Then, 



= P E r X( s ») ^ s * E r w 7r » + s * Ct ' s * 
Consider a sample path ui and consider the events 



yes- 



If w ^ S then, 



=*" E (- r £ n L(*i)( w ) + r*a«7r*) > SiC ttSi 



Thus w^Aso P(A) < P{B). Then continuing from (O 



we have 

P(r*(*i)>/i i + c t ,0 
< ^P^_ r ;4( Si ) + r; S ^<-^i) (14) 



yeS' 



yes* 



Epk»jw-w> 






(15) 



< 



A I 10|5*|ri / 

yes* \ I I V / 



iV q it 20 <' SI "1^ (16) 



15*1 A e max VI 



ALii 2 «sg lax .2 



■3 — Sr 



< i ~ ! I i [ e ' nax i_r \t 

TTmin I lOl^lrmin / 



(17) 



where (fTBT l follows from Lemma [3] Similarly, we have 
P(r*(a)<A**-Ct,.) 

= P ( E r yK( s ) _ S7r P - - sc m) 
ye|s*l 

< £ P(r*n* y (s) ~ r* y STT* < -sc t , s ) 



y£\S' 



= E p((* -£<(*)) -s(i-£<)<-^) 
= E F K E <(») - ^ s E ^ > «*,.) 

+ 1'" 
\ \ \)\ .1 \l I 

V£\S*\ 



< y ( 1 + fVmM 



jy ■ ^ aodS'k*)- (lg) 



<VI 



te„i„-10SS 



TTmin I 10|S'*|r m i n / 



(19) 



where ( fl~8l ) again follows from Lemma [3] Then from ( fTTb 
and ( fT9b . we have 

„r™, „ 4Llnn 

°° t_1 t_1 J- gmin -10Sg lax r, 2 nax 

+ (£ ,4 +£'*)£££t 2 ° S S.ax^ a x 



t = l 8 = 1 S i = l 



< 



4Llnn 
AL Inn 



+ 1 + (L> J +D*)£<" 
t=i 



Lc mi „-5DS« 



(20) 



where 



jyi \S*\ L , £ma X VI 



1015* 



£ = $> 



and the inequality in d20b follows from the assumption L > 
90S'^ lax r^ lax /e m i n . Thus we have obtained the following 
bound: 

J2 ("*-//)£[r»] 

i:fj,*<fi* 

* 4L E TTT^T + E (M* " /*V (2D 



/J.*<ll* z:/i s </i* 



where 



C l = 1 + (L> 4 + D*)/3. 
Using ( 1211 in Lemma [2] completes the proof. ■ 

IV. Discussion 

We have bounded the regret of the UCB policy uniformly 
over time by In n. While of the same order, this bound may 
be worse than the asymptotic bound given in [5] in terms of 
the constant. However, it holds uniformly over time and we 
have a very simple index based on the sample mean. The 
index policy in [5] depends on all the previous sequence 
of observations and the calculation of the index requires 
integration over the parameter space and finding an infimum 
of a function over the set of parameters. 

The bound we have in Theorem Q] depends on the station- 
ary distributions, eigenvalue gap from the arms and rewards. 
Note that the validity of this bound relies on selecting a 
sufficiently large value for the constant L that requires the 
knowledge of the smallest eigenvalue gap. If we know a 
priori (or with high confidence) that there exist c\, C2 and 
c 3 such that e min > a > 0, < r max < c 2 and S max < c 3 
then setting L = QQc\c^/ci will be sufficient. While this is a 
sufficient condition for the bound to hold and not necessary, 
similar results under a weaker condition are not yet available. 

Selecting a large L will increase the magnitude of the 
exploration component of the index which depends only 
on the current time and the total number of times the 
corresponding arm has been selected up to the current time. 
This means that the rate of exploration will increase, but the 
regret will remain logarithmic with time. The only things 
that change are the constant and the multiplicative factor of 
the logarithmic term of the bound. 

In general the eigenvalue gap is a complex expression of 
the components of the stochastic matrix. It can be simplified 
in special cases. In next section we give an example of the 
index policy. 

V. An Example 

Consider a player who plays one of K machines at each 
time. Each machine can be in one of two states "1" and "0" 
and is modeled as an irreducible and aperiodic Markov chain. 
This requirement along with time reversibility is satisfied 
if Poo > OiPii > 0, i — 1,2, ••• ,K. The stationary 
distribution of machine i is 



4=1 



^ = ^o^d = 



PlO 



Phi 



Pw+Phi Pw+Phi 



S.l 


Pol. Pio 


n>, n 


Tli M 


ch.l 


.3, .5 


1, 1.2 


.3750, 1.075 


ch.2 


.2, .6 


1, 1.7 


.2500, 1.175 


ch.3 


.6, .3 


1, 1.5 


.6667, 1.333 


ch.4 


.7, .2 


1, 1.8 


.7778, 1.622 


ch.5 


.4, .8 


1, 1.3 


.3333, 1.100 


S.2 


Poi. PlO 


ro, n 


Ti, n 


ch.l 


.0001, .9975 


1, 2 


.0001, 1.000 


ch.2 


.0010, .9900 


1, 2 


.0010, 1.001 


ch.3 


.3430, .5100 


1, 2 


.4021, 1.402 


ch.4 


.1250, .7500 


1, 2 


.1429, 1.143 


ch.5 


.0270, .9100 


1, 2 


.0288, 1.029 



TABLE I 

Parameters of the arms for S.l and S.2 



-UCB, L=1500 
- UCB, L=2 




and the eigenvalue gap is, 



Fig. 1. Regret of UCB for S.l 



e =Pw+Poi ■ 

Figures Q] (S.l) and |4] (S.2) show the simulation results 
for 5 arms averaged over 100 runs with parameters given 
in table U 90S£ ax r^ ax /e min is 1458 for S.l and 1688.2 for 
S.2. 

In figure Q] we see that the performance is better when 
L = 2 (which violates our sufficient condition), compared to 
L = 2000 > 905^ lax r^ lax /e m i n which satisfies the condition 
of theorem Q] The bound from theorem Q] in this case is 
45150 Inn + 62.8. As e becomes smaller the Gillman bound 
becomes loser. However, our results for the two-state arms 
suggest that even when L is small (e.g., L — 2) compared to 
90'$max r max/ e min, the UCB policy works well (and indeed 
better as suggested by our numerical results); the resulting 
regret is at most logarithmic with n. This seems to suggest 
that UCB's regret can be bounded logarithmically under any 
value of L for two-state irreducible Markov chains. 

We next compare the performance of UCB with the 
index policy given in [5] assuming the player is restricted 
to playing one arm at a time. We generate the transition 
probability functions parametrized by 9 £ [0, 10], satisfying 
the conditions in [5]. The parameter set is 6 = [0.5, 1, 7, 5, 3] 
where ith element corresponds to the parameter of arm i. 
Moreover, u(#) is increasing in 9, and poi(0) and pio(9) are 
log-concave in 8 by letting 

PioW = l-(^) 2 , 

Poi(«) = (|) 3 

Any policy for which R a (n) = o(n 7 ) for every 7 > is 
called a uniformly good policy by [5]. It was shown that for 
any uniformly good policy a, 
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and that the index policy a* in [5] satisfies 



lim sup 



R a '(n) 
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Figure [2] shows Poi,Pw,p- we used in this set of ex- 
periments; Figure [3] shows lnpoijlnpio as functions of 9. 
Figure |4] compares the regret of the index policy of [5] 
(labeled as Anantharam's policy in the figure) with UCB 
under different values of L. Note that the index policy of 
[5] assumes the knowledge of poi(9) and pio(9), while in 
UCB these functions are unknown to the player. Simulation 
for L — 1500 > SOiS^g^r^x/emin satisfies the sufficient 
condition in theorem Q] for the bound to hold. The bound 
from theorem Q] for this case is 39846 Inn + 45, while 
Anantharam's bound is 4.406 Inn. 

The first thing to note is the gap between the bound we 
derived for UCB and the bound of [5] given in (l22l . The 
second thing to note is that for L — 0.05 UCB has smaller 
regret than the index policy of [5], as well as the bound 
in (F22l . for the given time horizon. Note that [5] proved 
that the performance of any uniformly good policy cannot be 
better than the bound in (1221 asymptotically. Since uniformly 
good policies have the minimum growth of regret among all 
policies, this bound also holds for UCB. This however is 
not a contradiction because this bound holds asymptotically; 
we indeed expect the regret of UCB with L — 0.05 to be 
very close to this bound in the limit. These results show 
that while the bound in [5] is better than the bound we 
proved for UCB in this paper, in reality the UCB policy 
can perform very close to the tighter bound (uniformly, not 
just asymptotically). 

VI. Conclusion 

In this study we considered the multi-armed bandit prob- 
lem with Markovian rewards, and proved that a sample 
mean based index policy achieves logarithmic regret uni- 
formly over time provided that an exploration constant is 
sufficiently large with respect to the eigenvalue gaps of the 
stochastic matrices of the arms. An example was presented 




Fig. 2. poiiPlO; A 4 as functions of 6 




for a special case of two-state Markovian reward models. 
Numerical results suggest that in this case order optimality 
of the index policy holds even when the sufficient condition 
on the exploration constant does not hold. 
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Fig. 3. lnpoii ln PlO as functions of 6 
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Fig. 4. Regrets of UCB and Anantharam's policy for S.2 



